Identify Diabetic Patient Readmission

image.png

OBJECTIVE : To predict the hospital re-admission probability of a DIABETIC patient by using appropriate Data Science techniques.

  • A clinic readmission is the point at which a patient who is released from the emergency clinic, gets re-conceded again inside a specific timeframe. Clinic readmission rates for specific conditions are currently viewed as a marker of medical clinic quality, and furthermore influence the expense of care antagonistically.
  • Thus, Centers for Medicare and Medicaid Services set up the Hospital Readmissions Reduction Program which means to improve nature of care for patients and decrease social insurance spending by applying installment punishments to emergency clinics that have more than anticipated readmission rates for specific conditions.
  • In 2011, American medical clinics spent over $41 billion on diabetic patients who got readmitted inside 30 days of release. Having the option to decide factors that lead to higher readmission in such patients, and correspondingly having the option to anticipate which patients will get readmitted can assist medical clinics with sparing a great many dollars while improving nature of care.

Below are questions need to be answered from the analysis :

  • What variables are the most grounded indicators of emergency clinic readmission in diabetic patients?
  • How well would we be able to foresee medical clinic readmission in this dataset with constrained highlights?

About DATA : Data is retrieved from UCI (University of California, Irvine) repository

  • Encountered data are collected from 130 hospitals on Diabetic patient for a period of 1999-2008
  • Dataset has over 50 features, including patient characteristics, conditions, tests and 23 medications.

Why this DATA?

  • People Affected by Diabetes : WORLD: 425m, USA: 26m (8.3% of the population)
  • Expenditure on Diabetes: WORLD: 727b Dollars, USA: 327b Dollars
  • People will be affected by Diabetes : 629 million
  • Penalties paid by US Hospitals due to readmission of patients: 528 million Dollars 
  • Readmission Rates of diabetes patients are readmitted with 30-days of discharge : 20.3%

STEP: 1 - Data Cleaning

  • Loading the required libraries for downstream activities
In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
plt.style.use('seaborn')
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
import statsmodels.api as stat_model
from sklearn.model_selection import GridSearchCV
pd.options.mode.chained_assignment = None
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy.stats import randint as sp_randint
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.metrics import f1_score, precision_score, confusion_matrix,recall_score, precision_recall_curve, auc, precision_recall_fscore_support
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import Imputer
from scipy.stats import skew
from pandas.plotting import scatter_matrix
from imblearn.over_sampling import SMOTE
from collections import Counter

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
  • Reading the data from the csv files, will be used for training and testing of the model
In [2]:
data=pd.read_csv(r"challengetraining_data.csv")
print(data.shape)
data.head()
(81414, 50)
Out[2]:
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
0 2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ? Pediatrics-Endocrinology 41 0 1 0 0 0 250.83 ? ? 1 None None No No No No No No No No No No No No No No No No No No No No No No No No No N
1 149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ? ? 59 0 18 0 0 0 276 250.01 255 9 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes N
2 64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ? ? 11 5 13 2 0 1 648 250 V27 6 None None No No No No No No Steady No No No No No No No No No No No No No No No No No Yes N
3 500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ? ? 44 1 16 0 0 0 8 250.43 403 7 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes N
4 16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ? ? 51 0 8 0 0 0 197 157 250 5 None None No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes N
  • Checking the null values for data loaded, but no null values found
In [3]:
data.isnull().sum()
Out[3]:
encounter_id                0
patient_nbr                 0
race                        0
gender                      0
age                         0
weight                      0
admission_type_id           0
discharge_disposition_id    0
admission_source_id         0
time_in_hospital            0
payer_code                  0
medical_specialty           0
num_lab_procedures          0
num_procedures              0
num_medications             0
number_outpatient           0
number_emergency            0
number_inpatient            0
diag_1                      0
diag_2                      0
diag_3                      0
number_diagnoses            0
max_glu_serum               0
A1Cresult                   0
metformin                   0
repaglinide                 0
nateglinide                 0
chlorpropamide              0
glimepiride                 0
acetohexamide               0
glipizide                   0
glyburide                   0
tolbutamide                 0
pioglitazone                0
rosiglitazone               0
acarbose                    0
miglitol                    0
troglitazone                0
tolazamide                  0
examide                     0
citoglipton                 0
insulin                     0
glyburide-metformin         0
glipizide-metformin         0
glimepiride-pioglitazone    0
metformin-rosiglitazone     0
metformin-pioglitazone      0
change                      0
diabetesMed                 0
readmitted                  0
dtype: int64
  • Displaying the unique values and percentage of unique values for each features, to identify other type of missing values and variation explain by each variables.
  • Analyzed that feature weight, payer_code and medical_speciality have approximately 40% or more than 40% of data missing and represented by '?'
  • In variable 'Gender', missing value is shown by 'Unknown/Invalid', approximately 5 records
In [4]:
for col in data:
    print('Column name : {}\n{}'.format(col, data[col].value_counts()/len(data)))
Column name : encounter_id
116046918    0.000012
74824098     0.000012
84657354     0.000012
172755348    0.000012
211901586    0.000012
               ...   
276702294    0.000012
356032316    0.000012
129070404    0.000012
199469046    0.000012
77856768     0.000012
Name: encounter_id, Length: 81414, dtype: float64
Column name : patient_nbr
88785891    0.000405
43140906    0.000307
88227540    0.000270
1660293     0.000258
88789707    0.000233
              ...   
86882463    0.000012
47165598    0.000012
78618780    0.000012
24246423    0.000012
93251151    0.000012
Name: patient_nbr, Length: 60057, dtype: float64
Column name : race
Caucasian          0.747316
AfricanAmerican    0.189316
?                  0.022269
Hispanic           0.019861
Other              0.014936
Asian              0.006301
Name: race, dtype: float64
Column name : gender
Female             0.537401
Male               0.462574
Unknown/Invalid    0.000025
Name: gender, dtype: float64
Column name : age
[70-80)     0.256590
[60-70)     0.221817
[80-90)     0.169504
[50-60)     0.168718
[40-50)     0.094677
[30-40)     0.037082
[90-100)    0.027219
[20-30)     0.016115
[10-20)     0.006719
[0-10)      0.001560
Name: age, dtype: float64
Column name : weight
?            0.968433
[75-100)     0.013204
[50-75)      0.008819
[100-125)    0.006252
[125-150)    0.001474
[25-50)      0.000860
[0-25)       0.000479
[150-175)    0.000356
[175-200)    0.000098
>200         0.000025
Name: weight, dtype: float64
Column name : admission_type_id
1    0.532058
3    0.184526
2    0.181222
6    0.051846
5    0.046822
8    0.003230
7    0.000209
4    0.000086
Name: admission_type_id, dtype: float64
Column name : discharge_disposition_id
1     0.591987
3     0.137225
6     0.126821
18    0.036075
2     0.021041
22    0.019874
11    0.016373
5     0.011534
25    0.009495
4     0.007787
7     0.006252
13    0.003931
23    0.003918
14    0.003476
28    0.001277
8     0.001044
15    0.000688
24    0.000528
9     0.000197
16    0.000135
17    0.000135
19    0.000074
10    0.000061
12    0.000037
27    0.000025
20    0.000012
Name: discharge_disposition_id, dtype: float64
Column name : admission_source_id
7     0.566487
1     0.289729
17    0.065996
4     0.031297
6     0.022355
2     0.010662
5     0.008328
3     0.001879
20    0.001572
9     0.001241
8     0.000160
22    0.000123
10    0.000098
11    0.000025
25    0.000025
13    0.000012
14    0.000012
Name: admission_source_id, dtype: float64
Column name : time_in_hospital
3     0.174724
2     0.169627
1     0.138760
4     0.137053
5     0.098104
6     0.073918
7     0.057165
8     0.043273
9     0.029823
10    0.023055
11    0.017921
12    0.014322
13    0.011964
14    0.010293
Name: time_in_hospital, dtype: float64
Column name : payer_code
?     0.395890
MC    0.318680
HM    0.061292
SP    0.048874
BC    0.045692
MD    0.034404
CP    0.024910
UN    0.024455
CM    0.019210
OG    0.010232
PO    0.005859
DM    0.005429
CH    0.001413
WC    0.001314
OT    0.000983
MP    0.000798
SI    0.000553
FR    0.000012
Name: payer_code, dtype: float64
Column name : medical_specialty
?                                       0.490518
InternalMedicine                        0.144619
Emergency/Trauma                        0.074594
Family/GeneralPractice                  0.072764
Cardiology                              0.052264
Surgery-General                         0.030597
Nephrology                              0.015599
Orthopedics                             0.013634
Orthopedics-Reconstructive              0.011890
Radiologist                             0.011300
Pulmonology                             0.008721
Psychiatry                              0.008377
Urology                                 0.006706
ObstetricsandGynecology                 0.006670
Surgery-Cardiovascular/Thoracic         0.006424
Gastroenterology                        0.005699
Surgery-Vascular                        0.005073
Surgery-Neuro                           0.004569
PhysicalMedicineandRehabilitation       0.003967
Oncology                                0.003402
Pediatrics                              0.002493
Hematology/Oncology                     0.002137
Neurology                               0.002002
Pediatrics-Endocrinology                0.001499
Otolaryngology                          0.001216
Endocrinology                           0.001142
Surgery-Thoracic                        0.001093
Surgery-Cardiovascular                  0.001007
Podiatry                                0.000970
Psychology                              0.000934
Pediatrics-CriticalCare                 0.000884
Hematology                              0.000835
Gynecology                              0.000577
Hospitalist                             0.000528
Radiology                               0.000504
Surgeon                                 0.000479
Surgery-Plastic                         0.000393
Osteopath                               0.000393
Ophthalmology                           0.000368
InfectiousDiseases                      0.000332
SurgicalSpecialty                       0.000295
Obsterics&Gynecology-GynecologicOnco    0.000258
Pediatrics-Pulmonology                  0.000233
Rheumatology                            0.000172
Anesthesiology-Pediatric                0.000172
Obstetrics                              0.000172
Pathology                               0.000160
PhysicianNotFound                       0.000135
Surgery-Colon&Rectal                    0.000123
OutreachServices                        0.000123
Pediatrics-Neurology                    0.000123
Anesthesiology                          0.000111
Surgery-Maxillofacial                   0.000098
AllergyandImmunology                    0.000086
Psychiatry-Child/Adolescent             0.000086
Endocrinology-Metabolism                0.000074
Cardiology-Pediatric                    0.000074
Surgery-Pediatric                       0.000061
DCPTEAM                                 0.000049
Dentistry                               0.000037
Pediatrics-EmergencyMedicine            0.000025
Pediatrics-Hematology-Oncology          0.000025
Pediatrics-AllergyandImmunology         0.000025
Resident                                0.000025
Neurophysiology                         0.000012
Pediatrics-InfectiousDiseases           0.000012
Speech                                  0.000012
Dermatology                             0.000012
Psychiatry-Addictive                    0.000012
SportsMedicine                          0.000012
Perinatology                            0.000012
Name: medical_specialty, dtype: float64
Column name : num_lab_procedures
1      0.031444
43     0.027170
44     0.024173
45     0.023387
40     0.021741
         ...   
129    0.000012
118    0.000012
109    0.000012
120    0.000012
132    0.000012
Name: num_lab_procedures, Length: 115, dtype: float64
Column name : num_procedures
0    0.458987
1    0.203675
2    0.124524
3    0.092662
6    0.048530
4    0.041295
5    0.030326
Name: num_procedures, dtype: float64
Column name : num_medications
13    0.059584
12    0.058577
11    0.056649
15    0.056575
14    0.056501
16    0.053566
10    0.052988
9     0.048431
17    0.048137
18    0.044452
8     0.042953
19    0.039711
20    0.036529
7     0.034257
21    0.031862
22    0.027993
6     0.026494
23    0.023878
24    0.020967
5     0.019579
25    0.018338
26    0.015612
27    0.014101
4     0.013855
28    0.012590
29    0.009679
3     0.008844
30    0.008377
31    0.007112
32    0.006277
33    0.004913
34    0.004520
2     0.004496
35    0.003943
37    0.002850
36    0.002813
1     0.002629
38    0.002383
39    0.001978
40    0.001818
41    0.001413
42    0.001228
43    0.001228
44    0.001007
46    0.000934
45    0.000860
47    0.000737
48    0.000626
49    0.000565
52    0.000528
50    0.000516
56    0.000418
51    0.000381
53    0.000344
55    0.000319
54    0.000307
60    0.000246
57    0.000221
58    0.000209
59    0.000197
62    0.000160
61    0.000135
65    0.000111
63    0.000098
68    0.000086
64    0.000074
67    0.000061
66    0.000049
69    0.000037
72    0.000037
70    0.000012
74    0.000012
75    0.000012
79    0.000012
81    0.000012
Name: num_medications, dtype: float64
Column name : number_outpatient
0     0.835962
1     0.084027
2     0.035068
3     0.019861
4     0.010748
5     0.005478
6     0.002850
7     0.001560
8     0.000970
9     0.000823
10    0.000540
11    0.000381
13    0.000319
12    0.000270
14    0.000258
15    0.000246
16    0.000160
17    0.000074
20    0.000049
21    0.000049
19    0.000037
18    0.000037
22    0.000025
23    0.000025
24    0.000025
36    0.000025
29    0.000025
40    0.000012
34    0.000012
37    0.000012
35    0.000012
26    0.000012
28    0.000012
27    0.000012
25    0.000012
42    0.000012
Name: number_outpatient, dtype: float64
Column name : number_emergency
0     0.888287
1     0.075122
2     0.020193
3     0.006977
4     0.003795
5     0.001904
6     0.000934
7     0.000725
8     0.000516
10    0.000319
9     0.000270
11    0.000233
13    0.000135
12    0.000111
16    0.000061
22    0.000061
18    0.000049
19    0.000049
14    0.000037
15    0.000037
20    0.000025
21    0.000025
76    0.000012
54    0.000012
24    0.000012
25    0.000012
28    0.000012
29    0.000012
37    0.000012
42    0.000012
46    0.000012
64    0.000012
63    0.000012
Name: number_emergency, dtype: float64
Column name : number_inpatient
0     0.663387
1     0.193394
2     0.074520
3     0.033004
4     0.015992
5     0.007873
6     0.004753
7     0.002579
8     0.001511
9     0.001032
10    0.000590
11    0.000479
12    0.000393
13    0.000184
15    0.000098
14    0.000086
16    0.000074
19    0.000025
17    0.000012
21    0.000012
Name: number_inpatient, dtype: float64
Column name : diag_1
428    0.066610
414    0.063982
786    0.039612
410    0.035522
486    0.034810
         ...   
391    0.000012
314    0.000012
797    0.000012
336    0.000012
405    0.000012
Name: diag_1, Length: 697, dtype: float64
Column name : diag_2
276    0.066598
428    0.065320
250    0.059953
427    0.049230
401    0.036578
         ...   
977    0.000012
975    0.000012
374    0.000012
615    0.000012
V13    0.000012
Name: diag_2, Length: 722, dtype: float64
Column name : diag_3
250     0.114108
401     0.081190
276     0.051134
428     0.045078
427     0.038753
          ...   
395     0.000012
884     0.000012
E922    0.000012
538     0.000012
944     0.000012
Name: diag_3, Length: 755, dtype: float64
Column name : number_diagnoses
9     0.486415
5     0.112155
8     0.104184
7     0.101752
6     0.099553
4     0.054991
3     0.027808
2     0.009851
1     0.002186
16    0.000393
10    0.000172
13    0.000147
11    0.000123
15    0.000111
12    0.000098
14    0.000061
Name: number_diagnoses, dtype: float64
Column name : max_glu_serum
None    0.947822
Norm    0.025069
>200    0.014690
>300    0.012418
Name: max_glu_serum, dtype: float64
Column name : A1Cresult
None    0.832314
>8      0.080993
Norm    0.049267
>7      0.037426
Name: A1Cresult, dtype: float64
Column name : metformin
No        0.804542
Steady    0.179515
Up        0.010330
Down      0.005613
Name: metformin, dtype: float64
Column name : repaglinide
No        0.984929
Steady    0.013462
Up        0.001179
Down      0.000430
Name: repaglinide, dtype: float64
Column name : nateglinide
No        0.992999
Steady    0.006633
Up        0.000258
Down      0.000111
Name: nateglinide, dtype: float64
Column name : chlorpropamide
No        0.999091
Steady    0.000823
Up        0.000074
Down      0.000012
Name: chlorpropamide, dtype: float64
Column name : glimepiride
No        0.949210
Steady    0.045447
Up        0.003378
Down      0.001965
Name: glimepiride, dtype: float64
Column name : acetohexamide
No        0.999988
Steady    0.000012
Name: acetohexamide, dtype: float64
Column name : glipizide
No        0.874002
Steady    0.113138
Up        0.007468
Down      0.005392
Name: glipizide, dtype: float64
Column name : glyburide
No        0.895780
Steady    0.090623
Up        0.008070
Down      0.005527
Name: glyburide, dtype: float64
Column name : tolbutamide
No        0.999803
Steady    0.000197
Name: tolbutamide, dtype: float64
Column name : pioglitazone
No        0.927457
Steady    0.068981
Up        0.002346
Down      0.001216
Name: pioglitazone, dtype: float64
Column name : rosiglitazone
No        0.937357
Steady    0.060076
Up        0.001806
Down      0.000762
Name: rosiglitazone, dtype: float64
Column name : acarbose
No        0.996929
Steady    0.002923
Up        0.000111
Down      0.000037
Name: acarbose, dtype: float64
Column name : miglitol
No        0.999607
Steady    0.000332
Down      0.000049
Up        0.000012
Name: miglitol, dtype: float64
Column name : troglitazone
No        0.999975
Steady    0.000025
Name: troglitazone, dtype: float64
Column name : tolazamide
No        0.999681
Steady    0.000319
Name: tolazamide, dtype: float64
Column name : examide
No    1.0
Name: examide, dtype: float64
Column name : citoglipton
No    1.0
Name: citoglipton, dtype: float64
Column name : insulin
No        0.464330
Steady    0.303167
Down      0.120876
Up        0.111627
Name: insulin, dtype: float64
Column name : glyburide-metformin
No        0.993122
Steady    0.006743
Up        0.000098
Down      0.000037
Name: glyburide-metformin, dtype: float64
Column name : glipizide-metformin
No        0.999889
Steady    0.000111
Name: glipizide-metformin, dtype: float64
Column name : glimepiride-pioglitazone
No        0.999988
Steady    0.000012
Name: glimepiride-pioglitazone, dtype: float64
Column name : metformin-rosiglitazone
No        0.999975
Steady    0.000025
Name: metformin-rosiglitazone, dtype: float64
Column name : metformin-pioglitazone
No        0.999988
Steady    0.000012
Name: metformin-pioglitazone, dtype: float64
Column name : change
No    0.537475
Ch    0.462525
Name: change, dtype: float64
Column name : diabetesMed
Yes    0.770543
No     0.229457
Name: diabetesMed, dtype: float64
Column name : readmitted
N    0.888398
Y    0.111602
Name: readmitted, dtype: float64
  • Approximately more than 40% of data are missing, Dropping the columns 'weights', 'payer_code' and 'medical_speciality'
  • Deleting the entire record for column 'Gender', which have value 'Unknown/Invalid'
In [5]:
data=data.drop(['weight', 'payer_code', 'medical_specialty'], axis=1)
data=data.replace('?',np.nan)
data['gender']=data['gender'].replace('Unknown/Invalid',np.nan)
data=data.dropna()
print(data.shape)
data.head()
(78465, 47)
Out[5]:
encounter_id patient_nbr race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted
1 149190 55629189 Caucasian Female [10-20) 1 1 7 3 59 0 18 0 0 0 276 250.01 255 9 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes N
2 64410 86047875 AfricanAmerican Female [20-30) 1 1 7 2 11 5 13 2 0 1 648 250 V27 6 None None No No No No No No Steady No No No No No No No No No No No No No No No No No Yes N
3 500364 82442376 Caucasian Male [30-40) 1 1 7 2 44 1 16 0 0 0 8 250.43 403 7 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes N
4 16680 42519267 Caucasian Male [40-50) 1 1 7 1 51 0 8 0 0 0 197 157 250 5 None None No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes N
5 35754 82637451 Caucasian Male [50-60) 2 1 2 3 31 6 16 0 0 0 414 411 250 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes N

STEP: 2 - Feature Engineering and Feature Creation

  • Number of medicine utilized : Total number of prescriptions utilized by the patient, so we made component by checking the meds utilized during the experience
  • Number of prescription changes : Data contains 23 variables for 23 medicines, regardless of whether an adjustment in that drug was made or not, so chose to tally what number of changes were made altogether for every patient
  • Service usage : Added information contains factors for number of inpatient, emergency room visits and outpatient visits
In [6]:
drugs =  ['metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 
        'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 
        'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 
        'metformin-pioglitazone', 'citoglipton', 'examide']

for col in drugs:
    colname = str(col) + 'temp'
    data[colname] = data[col].apply(lambda x: 0 if (x == 'No' or x == 'Steady') else 1)
    
data['numchange'] = 0
for col in drugs:
    colname = str(col) + 'temp'
    data['numchange'] = data['numchange'] + data[colname]
    del data[colname]
    
for col in drugs:
    data[col] = data[col].replace('No', 0)
    data[col] = data[col].replace('Steady', 1)
    data[col] = data[col].replace('Up', 1)
    data[col] = data[col].replace('Down', 1) 

data['nummed'] = 0
for col in drugs:
    data['nummed'] = data['nummed'] + data[col]

print(data.shape)
data.head()
(78465, 49)
Out[6]:
encounter_id patient_nbr race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted numchange nummed
1 149190 55629189 Caucasian Female [10-20) 1 1 7 3 59 0 18 0 0 0 276 250.01 255 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Ch Yes N 1 1
2 64410 86047875 AfricanAmerican Female [20-30) 1 1 7 2 11 5 13 2 0 1 648 250 V27 6 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 No Yes N 0 1
3 500364 82442376 Caucasian Male [30-40) 1 1 7 2 44 1 16 0 0 0 8 250.43 403 7 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Ch Yes N 1 1
4 16680 42519267 Caucasian Male [40-50) 1 1 7 1 51 0 8 0 0 0 197 157 250 5 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Ch Yes N 0 2
5 35754 82637451 Caucasian Male [50-60) 2 1 2 3 31 6 16 0 0 0 414 411 250 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 No Yes N 0 1
  • Encoded the 'medication change' feature from 'No' (no change) and 'Ch' (changed) into 0 and 1
  • Similarly, converted 'Gender' variable from 'Male' and 'Female' into 1 and 0
  • Also, changed 'diabetesMed' (Diabetic Medicine use) column from 'Yes' and 'No' into 1 and 0
In [7]:
data['change'] = data['change'].replace('Ch', 1)
data['change'] = data['change'].replace('No', 0)
data['gender'] = data['gender'].replace('Male', 1)
data['gender'] = data['gender'].replace('Female', 0)
data['diabetesMed'] = data['diabetesMed'].replace('Yes', 1)
data['diabetesMed'] = data['diabetesMed'].replace('No', 0)

print(data.shape)
data.head()
(78465, 49)
Out[7]:
encounter_id patient_nbr race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted numchange nummed
1 149190 55629189 Caucasian 0 [10-20) 1 1 7 3 59 0 18 0 0 0 276 250.01 255 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
2 64410 86047875 AfricanAmerican 0 [20-30) 1 1 7 2 11 5 13 2 0 1 648 250 V27 6 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 N 0 1
3 500364 82442376 Caucasian 1 [30-40) 1 1 7 2 44 1 16 0 0 0 8 250.43 403 7 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
4 16680 42519267 Caucasian 1 [40-50) 1 1 7 1 51 0 8 0 0 0 197 157 250 5 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 0 2
5 35754 82637451 Caucasian 1 [50-60) 2 1 2 3 31 6 16 0 0 0 414 411 250 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 N 0 1
  • Dataset has multiple occurrence of same patient, which bring in biasness. So, keeping the first instance of the patient encounter and removing the other instances, with the help of column patient_nbr
In [8]:
data = data.drop_duplicates(subset= ['patient_nbr'], keep = 'first')

print(data.shape)
data.head()
(57678, 49)
Out[8]:
encounter_id patient_nbr race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted numchange nummed
1 149190 55629189 Caucasian 0 [10-20) 1 1 7 3 59 0 18 0 0 0 276 250.01 255 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
2 64410 86047875 AfricanAmerican 0 [20-30) 1 1 7 2 11 5 13 2 0 1 648 250 V27 6 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 N 0 1
3 500364 82442376 Caucasian 1 [30-40) 1 1 7 2 44 1 16 0 0 0 8 250.43 403 7 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
4 16680 42519267 Caucasian 1 [40-50) 1 1 7 1 51 0 8 0 0 0 197 157 250 5 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 0 2
5 35754 82637451 Caucasian 1 [50-60) 2 1 2 3 31 6 16 0 0 0 414 411 250 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 N 0 1
  • 'Diag_1','Diag_2' and 'Diag_3' have almost 700-900 categories, including these variables with 700-900 dummies, will make our model complex and time consuming
  • 'Encounter_id' and 'patient_nbr' are columns, which just contain Id to uniquely identify each records and do not explain much variance about data
  • Based on above insights, dropping all the 5 columns, to remove redundant part from the data
In [9]:
data = data.drop(columns=['diag_1','diag_2','diag_3','encounter_id','patient_nbr'],axis=1)

print(data.shape)
data.head()
(57678, 44)
Out[9]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted numchange nummed
1 Caucasian 0 [10-20) 1 1 7 3 59 0 18 0 0 0 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
2 AfricanAmerican 0 [20-30) 1 1 7 2 11 5 13 2 0 1 6 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 N 0 1
3 Caucasian 1 [30-40) 1 1 7 2 44 1 16 0 0 0 7 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 1 1
4 Caucasian 1 [40-50) 1 1 7 1 51 0 8 0 0 0 5 None None 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 N 0 2
5 Caucasian 1 [50-60) 2 1 2 3 31 6 16 0 0 0 9 None None 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 N 0 1
  • Converting remaining categorical variables into numerical, column such as 'readmitted', 'A1Cresult' and 'max_glu_serum into binary values 1 and 0
  • 'Age' feature has 10 categories, assuming maximum values of each categories. For Ex, Age as 10 for category value '0-10'
In [10]:
data['age'] = data.age.map({'[0-10)':10,'[10-20)':20, '[20-30)':30, '[30-40)':40, '[40-50)':50,'[50-60)':60, '[60-70)':70, '[70-80)':80, '[80-90)':90,'[90-100)':100})
data['age'] = data['age'].astype('int64')
data['readmitted'] = data['readmitted'].map({'Y':1,'N':0})
data['A1Cresult'] = data['A1Cresult'].map({'None':-99,'>7':1, '>8':1, 'Norm':0})
data['max_glu_serum'] = data['max_glu_serum'].map({'None':-99,'>300':1,'>200':1,'Norm':0 })

print(data.shape)
data.head()
(57678, 44)
Out[10]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide-metformin glipizide-metformin glimepiride-pioglitazone metformin-rosiglitazone metformin-pioglitazone change diabetesMed readmitted numchange nummed
1 Caucasian 0 20 1 1 7 3 59 0 18 0 0 0 9 -99 -99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1
2 AfricanAmerican 0 30 1 1 7 2 11 5 13 2 0 1 6 -99 -99 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1
3 Caucasian 1 40 1 1 7 2 44 1 16 0 0 0 7 -99 -99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 1
4 Caucasian 1 50 1 1 7 1 51 0 8 0 0 0 5 -99 -99 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 2
5 Caucasian 1 60 2 1 2 3 31 6 16 0 0 0 9 -99 -99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1
  • Generating the basic statistics such as mean, standard deviation, min, max, etc for the continuous variable
  • Observed that 4 medicines out of 23 have mean value less than 0.1, which says dominance of one value in the feature
In [11]:
data.describe().T
Out[11]:
count mean std min 25% 50% 75% max
gender 57678.0 0.466001 0.498847 0.0 0.0 0.0 1.0 1.0
age 57678.0 71.089670 15.529343 10.0 60.0 70.0 80.0 100.0
admission_type_id 57678.0 2.092080 1.506279 1.0 1.0 1.0 3.0 8.0
discharge_disposition_id 57678.0 3.636846 5.278340 1.0 1.0 1.0 3.0 28.0
admission_source_id 57678.0 5.685963 4.152320 1.0 1.0 7.0 7.0 25.0
time_in_hospital 57678.0 4.337702 2.963873 1.0 2.0 4.0 6.0 14.0
num_lab_procedures 57678.0 43.193245 19.952895 1.0 31.0 44.0 57.0 132.0
num_procedures 57678.0 1.431291 1.757343 0.0 0.0 1.0 2.0 6.0
num_medications 57678.0 15.858768 8.261490 1.0 10.0 14.0 20.0 81.0
number_outpatient 57678.0 0.293318 1.077567 0.0 0.0 0.0 0.0 42.0
number_emergency 57678.0 0.113509 0.546149 0.0 0.0 0.0 0.0 42.0
number_inpatient 57678.0 0.230105 0.665467 0.0 0.0 0.0 0.0 12.0
number_diagnoses 57678.0 7.365183 1.879020 3.0 6.0 8.0 9.0 16.0
max_glu_serum 57678.0 -94.136863 21.453710 -99.0 -99.0 -99.0 -99.0 1.0
A1Cresult 57678.0 -81.308454 38.090555 -99.0 -99.0 -99.0 -99.0 1.0
metformin 57678.0 0.207618 0.405605 0.0 0.0 0.0 0.0 1.0
repaglinide 57678.0 0.013731 0.116375 0.0 0.0 0.0 0.0 1.0
nateglinide 57678.0 0.007126 0.084114 0.0 0.0 0.0 0.0 1.0
chlorpropamide 57678.0 0.001040 0.032237 0.0 0.0 0.0 0.0 1.0
glimepiride 57678.0 0.051649 0.221319 0.0 0.0 0.0 0.0 1.0
acetohexamide 57678.0 0.000017 0.004164 0.0 0.0 0.0 0.0 1.0
glipizide 57678.0 0.130240 0.336571 0.0 0.0 0.0 0.0 1.0
glyburide 57678.0 0.108828 0.311426 0.0 0.0 0.0 0.0 1.0
tolbutamide 57678.0 0.000225 0.015011 0.0 0.0 0.0 0.0 1.0
pioglitazone 57678.0 0.074916 0.263258 0.0 0.0 0.0 0.0 1.0
rosiglitazone 57678.0 0.065918 0.248140 0.0 0.0 0.0 0.0 1.0
acarbose 57678.0 0.002913 0.053891 0.0 0.0 0.0 0.0 1.0
miglitol 57678.0 0.000347 0.018618 0.0 0.0 0.0 0.0 1.0
troglitazone 57678.0 0.000035 0.005889 0.0 0.0 0.0 0.0 1.0
tolazamide 57678.0 0.000312 0.017663 0.0 0.0 0.0 0.0 1.0
examide 57678.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
citoglipton 57678.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
insulin 57678.0 0.512431 0.499850 0.0 0.0 1.0 1.0 1.0
glyburide-metformin 57678.0 0.006883 0.082679 0.0 0.0 0.0 0.0 1.0
glipizide-metformin 57678.0 0.000087 0.009310 0.0 0.0 0.0 0.0 1.0
glimepiride-pioglitazone 57678.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
metformin-rosiglitazone 57678.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
metformin-pioglitazone 57678.0 0.000017 0.004164 0.0 0.0 0.0 0.0 1.0
change 57678.0 0.449842 0.497482 0.0 0.0 0.0 1.0 1.0
diabetesMed 57678.0 0.760290 0.426910 0.0 1.0 1.0 1.0 1.0
readmitted 57678.0 0.091699 0.288603 0.0 0.0 0.0 0.0 1.0
numchange 57678.0 0.266393 0.478247 0.0 0.0 0.0 1.0 4.0
nummed 57678.0 1.184334 0.941463 0.0 1.0 1.0 2.0 6.0
  • Dropping the 19 medicine columns, since they do not explain much variance and adding no meaning to our model
In [12]:
drugs_drop =  ['repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 
        'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 
        'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 
        'metformin-pioglitazone', 'citoglipton', 'examide']

data.drop(columns=drugs_drop, inplace=True, axis=1)

print(data.shape)
data.head()
(57678, 25)
Out[12]:
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses max_glu_serum A1Cresult metformin glipizide glyburide insulin change diabetesMed readmitted numchange nummed
1 Caucasian 0 20 1 1 7 3 59 0 18 0 0 0 9 -99 -99 0 0 0 1 1 1 0 1 1
2 AfricanAmerican 0 30 1 1 7 2 11 5 13 2 0 1 6 -99 -99 0 1 0 0 0 1 0 0 1
3 Caucasian 1 40 1 1 7 2 44 1 16 0 0 0 7 -99 -99 0 0 0 1 1 1 0 1 1
4 Caucasian 1 50 1 1 7 1 51 0 8 0 0 0 5 -99 -99 0 1 0 1 1 1 0 0 2
5 Caucasian 1 60 2 1 2 3 31 6 16 0 0 0 9 -99 -99 0 0 0 1 0 1 0 0 1
  • Checking the datatype of each variable in the dataset and found almost all variables are integer
In [13]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57678 entries, 1 to 81413
Data columns (total 25 columns):
race                        57678 non-null object
gender                      57678 non-null int64
age                         57678 non-null int64
admission_type_id           57678 non-null int64
discharge_disposition_id    57678 non-null int64
admission_source_id         57678 non-null int64
time_in_hospital            57678 non-null int64
num_lab_procedures          57678 non-null int64
num_procedures              57678 non-null int64
num_medications             57678 non-null int64
number_outpatient           57678 non-null int64
number_emergency            57678 non-null int64
number_inpatient            57678 non-null int64
number_diagnoses            57678 non-null int64
max_glu_serum               57678 non-null int64
A1Cresult                   57678 non-null int64
metformin                   57678 non-null int64
glipizide                   57678 non-null int64
glyburide                   57678 non-null int64
insulin                     57678 non-null int64
change                      57678 non-null int64
diabetesMed                 57678 non-null int64
readmitted                  57678 non-null int64
numchange                   57678 non-null int64
nummed                      57678 non-null int64
dtypes: int64(24), object(1)
memory usage: 11.4+ MB
  • Creating dummies for the column race, admission_type, discharge_disposition and admission_source
  • Variable race will have 5 dummies, admission_type 8 dummies, discharge_disposition approx 26 dummies and admission_source approx 20 dummies
  • Deleting the main column from the dataset, after concatenating dummies for all the above columns in the main dataset
In [14]:
race_dummy = pd.get_dummies(data['race'],prefix='race')
admission_type_dummy = pd.get_dummies(data['admission_type_id'],prefix='admission_type')
discharge_disposition_dummy = pd.get_dummies(data['discharge_disposition_id'],prefix='discharge')
admission_source_dummy = pd.get_dummies(data['admission_source_id'],prefix='admission_source')

data = pd.concat([data, race_dummy,admission_type_dummy,discharge_disposition_dummy,admission_source_dummy], axis = 1)

data.drop(columns=['race', 'admission_type_id','discharge_disposition_id', 'admission_source_id'], axis = 1, inplace=True)

print(data.shape)
data.head()
(57678, 77)
Out[14]:
gender age time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses max_glu_serum A1Cresult metformin glipizide glyburide insulin change diabetesMed readmitted numchange nummed race_AfricanAmerican race_Asian race_Caucasian race_Hispanic race_Other admission_type_1 admission_type_2 admission_type_3 admission_type_4 admission_type_5 admission_type_6 admission_type_7 admission_type_8 discharge_1 discharge_2 discharge_3 discharge_4 discharge_5 discharge_6 discharge_7 discharge_8 discharge_9 discharge_10 discharge_11 discharge_12 discharge_13 discharge_14 discharge_15 discharge_16 discharge_17 discharge_18 discharge_19 discharge_20 discharge_22 discharge_23 discharge_24 discharge_25 discharge_27 discharge_28 admission_source_1 admission_source_2 admission_source_3 admission_source_4 admission_source_5 admission_source_6 admission_source_7 admission_source_8 admission_source_9 admission_source_10 admission_source_11 admission_source_13 admission_source_14 admission_source_17 admission_source_20 admission_source_22 admission_source_25
1 0 20 3 59 0 18 0 0 0 9 -99 -99 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
2 0 30 2 11 5 13 2 0 1 6 -99 -99 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 1 40 2 44 1 16 0 0 0 7 -99 -99 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
4 1 50 1 51 0 8 0 0 0 5 -99 -99 0 1 0 1 1 1 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
5 1 60 3 31 6 16 0 0 0 9 -99 -99 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

STEP: 3 - Transformation and Outlier Removal

  • Checking if features have skewness and have high kurtosis, which would impact standardization
  • Three columns need to be transformed, performing log transformation where a skew or kurtosis beyond the limits of -2 ≤ skew and kurtosis ≤ 2
In [15]:
train=data

num_col = ['age', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses']

statdataframe = pd.DataFrame()
statdataframe['numeric_column'] = num_col

skew_before = []
skew_after = []

kurt_before = []
kurt_after = []

standard_deviation_before = []
standard_deviation_after = []

log_transform_needed = []

log_type = []

for i in num_col:
    skewval = train[i].skew()
    skew_before.append(skewval)
    
    kurtval = train[i].kurtosis()
    kurt_before.append(kurtval)
    
    sdval = train[i].std()
    standard_deviation_before.append(sdval)
    
    if (abs(skewval) >2) & (abs(kurtval) >2):
        log_transform_needed.append('Yes')
        
        if len(train[train[i] == 0])/len(train) <=0.02:
            log_type.append('log')
            skewvalnew = np.log(pd.DataFrame(train[train_data[i] > 0])[i]).skew()
            skew_after.append(skewvalnew)
            
            kurtvalnew = np.log(pd.DataFrame(train[train_data[i] > 0])[i]).kurtosis()
            kurt_after.append(kurtvalnew)
            
            sdvalnew = np.log(pd.DataFrame(train[train_data[i] > 0])[i]).std()
            standard_deviation_after.append(sdvalnew)
            
        else:
            log_type.append('log1p')
            skewvalnew = np.log1p(pd.DataFrame(train[train[i] >= 0])[i]).skew()
            skew_after.append(skewvalnew)
        
            kurtvalnew = np.log1p(pd.DataFrame(train[train[i] >= 0])[i]).kurtosis()
            kurt_after.append(kurtvalnew)
            
            sdvalnew = np.log1p(pd.DataFrame(train[train[i] >= 0])[i]).std()
            standard_deviation_after.append(sdvalnew)
            
    else:
        log_type.append('NA')
        log_transform_needed.append('No')
        
        skew_after.append(skewval)
        kurt_after.append(kurtval)
        standard_deviation_after.append(sdval)

statdataframe['skew_before'] = skew_before
statdataframe['kurtosis_before'] = kurt_before
statdataframe['standard_deviation_before'] = standard_deviation_before
statdataframe['log_transform_needed'] = log_transform_needed
statdataframe['log_type'] = log_type
statdataframe['skew_after'] = skew_after
statdataframe['kurtosis_after'] = kurt_after
statdataframe['standard_deviation_after'] = standard_deviation_after

statdataframe
Out[15]:
numeric_column skew_before kurtosis_before standard_deviation_before log_transform_needed log_type skew_after kurtosis_after standard_deviation_after
0 age -0.570989 0.166001 15.529343 No NA -0.570989 0.166001 15.529343
1 time_in_hospital 1.156465 0.927537 2.963873 No NA 1.156465 0.927537 2.963873
2 num_lab_procedures -0.219924 -0.288312 19.952895 No NA -0.219924 -0.288312 19.952895
3 num_procedures 1.219148 0.539435 1.757343 No NA 1.219148 0.539435 1.757343
4 num_medications 1.413154 3.742335 8.261490 No NA 1.413154 3.742335 8.261490
5 number_outpatient 8.777192 151.174081 1.077567 Yes log1p 3.062745 9.938887 0.387981
6 number_emergency 21.562148 1165.513287 0.546149 Yes log1p 4.090378 19.893543 0.238866
7 number_inpatient 4.822509 36.413340 0.665467 Yes log1p 2.566101 6.529051 0.330318
8 number_diagnoses -0.699220 -0.617789 1.879020 No NA -0.699220 -0.617789 1.879020
  • Computing log(x) for any feature x if percentage of 0s in x ≤ 2%, after removing the zeros, which ensures that We don’t bulk-remove records that hold predictive power for other columns
  • Compute log1p(x) otherwise (log1p(x) means log(x+1), while retaining the zeros
In [16]:
for i in range(len(statdataframe)):
    if statdataframe['log_transform_needed'][i] == 'Yes':
        colname = str(statdataframe['numeric_column'][i])
        
        if statdataframe['log_type'][i] == 'log':
            train = train[train[colname] > 0]
            train[colname + "_log"] = np.log(train[colname])
            
        elif statdataframe['log_type'][i] == 'log1p':
            train = train[train[colname] >= 0]
            train[colname + "_log1p"] = np.log1p(train[colname])

train = train.drop(['number_outpatient', 'number_inpatient', 'number_emergency'], axis = 1)

print(train.shape)
train.head()
(57678, 77)
Out[16]:
gender age time_in_hospital num_lab_procedures num_procedures num_medications number_diagnoses max_glu_serum A1Cresult metformin glipizide glyburide insulin change diabetesMed readmitted numchange nummed race_AfricanAmerican race_Asian race_Caucasian race_Hispanic race_Other admission_type_1 admission_type_2 admission_type_3 admission_type_4 admission_type_5 admission_type_6 admission_type_7 admission_type_8 discharge_1 discharge_2 discharge_3 discharge_4 discharge_5 discharge_6 discharge_7 discharge_8 discharge_9 discharge_10 discharge_11 discharge_12 discharge_13 discharge_14 discharge_15 discharge_16 discharge_17 discharge_18 discharge_19 discharge_20 discharge_22 discharge_23 discharge_24 discharge_25 discharge_27 discharge_28 admission_source_1 admission_source_2 admission_source_3 admission_source_4 admission_source_5 admission_source_6 admission_source_7 admission_source_8 admission_source_9 admission_source_10 admission_source_11 admission_source_13 admission_source_14 admission_source_17 admission_source_20 admission_source_22 admission_source_25 number_outpatient_log1p number_emergency_log1p number_inpatient_log1p
1 0 20 3 59 0 18 9 -99 -99 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
2 0 30 2 11 5 13 6 -99 -99 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1.098612 0.0 0.693147
3 1 40 2 44 1 16 7 -99 -99 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
4 1 50 1 51 0 8 5 -99 -99 0 1 0 1 1 1 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
5 1 60 3 31 6 16 9 -99 -99 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
  • Anything within 3 Standard Deviations on either side of the mean would include 99.7% of our data and the remaining 0.3% we treat as outliers
  • Using this logic, We restricted the data to within 3 Standard deviations on either side from the mean for each numeric column
In [17]:
num_cols = ['age', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_diagnoses']
train = train[(np.abs(sp.stats.zscore(train[num_cols])) < 3).all(axis=1)]

print(train.shape)
train.head()
(55944, 77)
Out[17]:
gender age time_in_hospital num_lab_procedures num_procedures num_medications number_diagnoses max_glu_serum A1Cresult metformin glipizide glyburide insulin change diabetesMed readmitted numchange nummed race_AfricanAmerican race_Asian race_Caucasian race_Hispanic race_Other admission_type_1 admission_type_2 admission_type_3 admission_type_4 admission_type_5 admission_type_6 admission_type_7 admission_type_8 discharge_1 discharge_2 discharge_3 discharge_4 discharge_5 discharge_6 discharge_7 discharge_8 discharge_9 discharge_10 discharge_11 discharge_12 discharge_13 discharge_14 discharge_15 discharge_16 discharge_17 discharge_18 discharge_19 discharge_20 discharge_22 discharge_23 discharge_24 discharge_25 discharge_27 discharge_28 admission_source_1 admission_source_2 admission_source_3 admission_source_4 admission_source_5 admission_source_6 admission_source_7 admission_source_8 admission_source_9 admission_source_10 admission_source_11 admission_source_13 admission_source_14 admission_source_17 admission_source_20 admission_source_22 admission_source_25 number_outpatient_log1p number_emergency_log1p number_inpatient_log1p
2 0 30 2 11 5 13 6 -99 -99 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1.098612 0.0 0.693147
3 1 40 2 44 1 16 7 -99 -99 0 0 0 1 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
4 1 50 1 51 0 8 5 -99 -99 0 1 0 1 1 1 0 0 2 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
5 1 60 3 31 6 16 9 -99 -99 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000
6 1 70 4 70 1 21 7 -99 -99 1 0 0 1 1 1 0 0 3 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.000000 0.0 0.000000

STEP: 4 - Exploratory Data Analysis and Sampling

  • Plotting Scatter matrix to check whether variables are related and whether variables relation is positive or negative
In [18]:
attributes=num_cols
scatter_matrix(train[attributes], figsize = (20,15), c = train.readmitted, alpha = 0.8, cmap="Reds", marker = '+')
plt.show()
  • Graphing Heatmap to check whether interaction terms of variables have any positive or negative correlation
In [19]:
plt.figure(figsize=(15,10)) 
sns.heatmap(train[attributes].corr(), annot=True, cmap="Reds") 
plt.show()
  • Displaying pairplot to have histogram and scatterplot together, inorder to find the distribution of the variables and see relationship between the variables
In [20]:
sns.pairplot(train, hue = 'readmitted', vars = num_cols, palette="Reds", markers="+")
Out[20]:
<seaborn.axisgrid.PairGrid at 0x213422db788>
  • Labelling boxplot to see the amount of variation explained by continuous variables such as age, time in hospital, number of lab procedures, etc.
In [21]:
plt.figure(figsize=(15,10))
sns.boxplot(data = train[num_cols], palette="Reds")
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x213422f4088>
  • Checking the counts for number of 1's and 0's in target variable 'readmitted'
  • By Analyzing, found that number of 0's are more and dataset is highly imbalanced, which will lead model to predict 0's correctly but not 1's
In [22]:
sns.countplot(train['readmitted'], label = "Count", palette="Reds")
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x2133de43d48>
  • Separating the dataset from dependent variable 'readmitted' and all other independent variables
In [23]:
train_input=train.drop('readmitted',axis=1)
train_output=train['readmitted']

print(train_input.shape)
print(train_output.shape)
(55944, 76)
(55944,)
  • Since the dataset is highly imbalanced, applying oversampling by SMOTE to make equal number of 0's and 1's
In [24]:
print('Original dataset shape {}'.format(Counter(train_output)))

sm = SMOTE(random_state=20)
train_input_new, train_output_new = sm.fit_sample(train_input, train_output)
print('New dataset shape {}'.format(Counter(train_output_new)))
Original dataset shape Counter({0: 50816, 1: 5128})
New dataset shape Counter({0: 50816, 1: 50816})
In [25]:
sns.countplot(train_output_new, label = "Count", palette="Reds")
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2133e4ec448>

STEP: 5 - Model Building and Evaluation

  • Splitting the dataset into train and test, 80% of data will be used in training the model and 20% of data in evaluating the model
  • Once the data is split, applying Min-Max Scaling to bring all the variable on one scale and not having influence of variables magnitude on model
In [26]:
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(train_input_new,train_output_new,test_size=0.2,random_state=2)

scaler=MinMaxScaler()
X_train=scaler.fit_transform(X_train_unscaled)
X_test=scaler.transform(X_test_unscaled)

print(X_train)
print(X_test)
[[0.         0.42857143 0.16666667 ... 0.         0.         0.        ]
 [0.94064732 0.57142857 0.08827939 ... 0.1733508  0.         0.2541988 ]
 [0.63533566 0.57142857 0.16666667 ... 0.06720357 0.         0.19709244]
 ...
 [0.39486302 0.85714286 0.13376141 ... 0.07276885 0.11533589 0.        ]
 [0.         0.57142857 0.22962775 ... 0.         0.1392363  0.27023815]
 [1.         0.57142857 0.40598802 ... 0.         0.         0.23560883]]
[[0.         0.57142857 0.41666667 ... 0.         0.         0.        ]
 [0.         0.71428571 0.00612223 ... 0.17074973 0.         0.        ]
 [1.         0.28571429 0.33333333 ... 0.         0.         0.        ]
 ...
 [0.         0.42857143 0.25       ... 0.         0.         0.        ]
 [0.49471373 0.42857143 0.74823791 ... 0.         0.09117022 0.27309526]
 [0.51470972 0.42857143 0.16666667 ... 0.         0.         0.        ]]

1) Logistic Regression :

  • Using Logistic Regression to see the relative impact of each variable and statistical significance of each factor on the probability of readmission
  • Training the Logistic Regression model on hyperparameter 'penalty' using Grid search and cross validation of 5
  • By Grid Search, we found that 'penalty'('l1') gives us the best result, using the best parameters to train our model
In [27]:
logit = LogisticRegression()
logit.fit(X_train,y_train)

param_grid = {'penalty':['l1', 'l2']}
grid_search = GridSearchCV(logit , param_grid, cv = 5 , return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_
Out[27]:
{'penalty': 'l1'}
  • Once we trained our model on best parameters, calculated evaluation parameter such as recall, precision and f1 score to judge the model
  • We found that, model has low evaluation parameters and concluded to use other algorithm to build our model
In [28]:
results = pd.DataFrame(index=None, columns=['model','f1_score_train','f1_score_test','train_precision_score',
                                            'test_precision_score','train_recall_score','test_recall_score'])

# Fitting the model on best parameters and priting the results
lreg_clf = LogisticRegression()
lreg_clf = LogisticRegression(penalty = 'l1')
lreg_clf.fit(X_train,y_train)
y_lreg_clf = lreg_clf.predict(X_test)
f1_score_train=f1_score(y_train, lreg_clf.predict(X_train))
f1_score_test=f1_score(y_test, lreg_clf.predict(X_test))
train_precision_score=precision_score(y_train,lreg_clf.predict(X_train))
test_precision_score=precision_score(y_test,lreg_clf.predict(X_test))
train_recall_score=recall_score(y_train,lreg_clf.predict(X_train))
test_recall_score=recall_score(y_test,lreg_clf.predict(X_test))
results = results.append(pd.Series({'model':'Logistic Regression','f1_score_train':f1_score_train,'f1_score_test':f1_score_test,
                                    'train_precision_score':train_precision_score,'train_recall_score':train_recall_score,
                                    'test_recall_score':test_recall_score,'test_precision_score':test_precision_score}) 
                         ,ignore_index=True )
results
Out[28]:
model f1_score_train f1_score_test train_precision_score test_precision_score train_recall_score test_recall_score
0 Logistic Regression 0.614488 0.61301 0.651122 0.647388 0.581756 0.5821
  • Plotting confusion matrix for Logistic Regression, to see the number True positive, True Negative, False Positive and False Negative
In [29]:
y_pred=lreg_clf.predict(X_test)

cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True, fmt='g', cmap="Reds")
plt.xlabel('Predicted')
plt.ylabel('Actual')
Out[29]:
Text(34.0, 0.5, 'Actual')

2) Decision Tree :

  • Using Decision Tree to capture the nonlinear effects of each features and interaction between variables
  • Training the Decision Tree model on hyperparameter 'max_depth' using Grid search and cross validation of 5
  • By Grid Search, we found that 'max_depth = 12' gives us the best result, using the best parameters to train our model
  • Calculating the evaluation metrics, after training the model, in order to judge it
  • We found that built Decision tree model has overall good evaluation metrics
In [30]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state=10)
param_grid = {'max_depth': [5,6,7, 8,10,12,15, 20, 50, 100]}

grid_search = GridSearchCV(dt_clf, param_grid, cv = 5, return_train_score=True)
grid_search.fit(X_train, y_train)
grid_search.best_params_

dt_clf = DecisionTreeClassifier(max_depth = 12)
dt_clf.fit(X_train,y_train)
y_dt_clf = dt_clf.predict(X_test)
train_precision_score=precision_score(y_train,dt_clf.predict(X_train))
test_precision_score=precision_score(y_test,dt_clf.predict(X_test))
f1_score_train=f1_score(y_train, dt_clf.predict(X_train))
f1_score_test=f1_score(y_test, dt_clf.predict(X_test))
train_recall_score=recall_score(y_train,dt_clf.predict(X_train))
test_recall_score=recall_score(y_test,dt_clf.predict(X_test))
results = results.append(pd.Series({'model':'Decision Tree','f1_score_train':f1_score_train,'f1_score_test':f1_score_test,
                                    'train_precision_score':train_precision_score,
                                    'train_recall_score':train_recall_score,
                                    'test_recall_score':test_recall_score,'test_precision_score':test_precision_score})
                         ,ignore_index=True )
results
Out[30]:
model f1_score_train f1_score_test train_precision_score test_precision_score train_recall_score test_recall_score
0 Logistic Regression 0.614488 0.613010 0.651122 0.647388 0.581756 0.582100
1 Decision Tree 0.929367 0.920368 0.995285 0.984704 0.871639 0.863923
  • Displaying confusion parameters to find number of correct and wrong prediction by the model for readmission or not
In [31]:
y_pred=dt_clf.predict(X_test)

cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True, fmt='g', cmap="Reds")
plt.xlabel('Predicted')
plt.ylabel('Actual')
Out[31]:
Text(34.0, 0.5, 'Actual')

3) Random Forest :

  • Since single tree has improved evaluation metrics so much, using many Decision trees with randomly assigned subsets of features, which is called Random Forest
  • Training the Random Forest model on hyperparameter 'max_depth', 'max_features', 'min_samples_split', 'min_samples_leaf' and 'Bootstrap', using Random search and iteration of 20
  • By Random Search and multiple iterations, we found that bootstrap=False, max_depth=8, max_features=22, min_samples_leaf=1, min_samples_split=5 gives us the best result, using the best parameters to train our model
  • Once we trained our model on best parameters, evaluation parameter such as recall, precision and f1 score was calculated to judge the model
  • We found that, model has best evaluation parameters in comparison to other algorithms and concluded Random Forest has go to algorithm for model deployment
In [32]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_grid = {"max_depth": [3, 5, 6,8],
              "max_features": sp_randint(1, 25),
              "min_samples_split": sp_randint(2, 30),
              "min_samples_leaf": sp_randint(1, 20),
              "bootstrap": [True, False]}
random_search = RandomizedSearchCV(RandomForestClassifier(n_estimators=1000), param_distributions=param_grid,
                                   n_iter=20, random_state=0,n_jobs=-1, return_train_score=True)
random_search.fit(X_train, y_train)
random_search.best_params_

rf_clf = RandomForestClassifier(bootstrap=False,max_depth=8,max_features=22,min_samples_leaf=1,min_samples_split=5)
rf_clf.fit(X_train,y_train)
y_rf_clf = rf_clf.predict(X_test)
train_precision_score=precision_score(y_train,rf_clf.predict(X_train))
test_precision_score=precision_score(y_test,rf_clf.predict(X_test))
f1_score_train=f1_score(y_train, rf_clf.predict(X_train))
f1_score_test=f1_score(y_test, rf_clf.predict(X_test))
train_recall_score=recall_score(y_train,rf_clf.predict(X_train))
test_recall_score=recall_score(y_test,rf_clf.predict(X_test))
results = results.append(pd.Series({'model':'Random Forest','f1_score_train':f1_score_train,'f1_score_test':f1_score_test,
                                    'train_precision_score':train_precision_score,'train_recall_score':train_recall_score,
                                    'test_recall_score':test_recall_score,'test_precision_score':test_precision_score})
                         ,ignore_index=True )
results
Out[32]:
model f1_score_train f1_score_test train_precision_score test_precision_score train_recall_score test_recall_score
0 Logistic Regression 0.614488 0.613010 0.651122 0.647388 0.581756 0.582100
1 Decision Tree 0.929367 0.920368 0.995285 0.984704 0.871639 0.863923
2 Random Forest 0.925424 0.922825 0.993431 0.991386 0.866132 0.863134
  • Labelling the actual and predicted numbers of 1's and 0's by Random Forest model
In [33]:
y_pred=rf_clf.predict(X_test)

cm=confusion_matrix(y_test,y_pred)
sns.heatmap(cm, annot=True, fmt='g', cmap="Reds")
plt.xlabel('Predicted')
plt.ylabel('Actual')
Out[33]:
Text(34.0, 0.5, 'Actual')

STEP: 6 - Best Model and Deployment

  • Comparing 3 algorithms model on f1_score evaluation metric
  • We found that Random Forest gives us the best result, on the basis of train and test f1_score
In [34]:
cols = ['model','f1_score_train','f1_score_test']
results[cols].set_index('model').plot(kind = 'bar', figsize=(15,8), cmap='Reds')
plt.title('Train and Test f1_score')
Out[34]:
Text(0.5, 1.0, 'Train and Test f1_score')
  • Deploying the random forest model and min-max scaler transformation to create the pickle files and dump it
  • Once the pickle files are created, passed one data to get the prediction and find prediction probability
  • First prediction probability says the probability of not readmitting and second one of readmitting
In [35]:
import pickle

pickle.dump(scaler, open('tranform.pkl','wb'))
pickle.dump(rf_clf, open('model.pkl','wb'))

X_test=scaler.transform(X_test_unscaled[:1])

predictions=rf_clf.predict(X_test)
print("Predicted Result : ",predictions)

predictions = rf_clf.predict_proba(X_test)
print("Predicted Result probability : ",predictions)
Predicted Result :  [0]
Predicted Result probability :  [[0.7263008 0.2736992]]
  • Built an interface in html for creating a complete product and mapping the deployed pickle files with 6 prominent features
  • Interface has input such as Discharge to home, Insulin, Gender, metfromin, Change in medicine as binary values and Number of medicine as values ranging from 1 to 23
  • Below is the image of product demo with desired inputs and result displaying probability of readmission or not
In [36]:
from IPython.display import Image
Image(filename=r'interface.png',width=1000, height=60)
Out[36]:

STEP: 7 - Interpretations and Insights

  • Interpreting the statistics from the models and drawing insights from it
  • Running the Logistic Regression to find the weightage influence of each variable
  • From Random Forest, obtaining feature importance of variables and plotting barplot for it
In [37]:
import plotly.graph_objs as go
import plotly.offline as py

features = train_input.columns.values

x, y = (list(x) for x in zip(*sorted(zip(rf_clf.feature_importances_, features), 
                                                            reverse = False)))
trace2 = go.Bar(
    x=x ,
    y=y,
    marker=dict(
        color=x,
        colorscale = 'Viridis',
        reversescale = True
    ),
    name='Random Forest Feature importance',
    orientation='h',
)

layout = dict(
    title='Barplot of Feature importances',
     width = 900, height = 2000,
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
#         domain=[0, 0.85],
    ),
    margin=dict(
    l=300,
),
)
fig1 = go.Figure(data=[trace2])
fig1['layout'].update(layout)

py.plot(fig1, filename='plots')

from IPython.display import Image
Image(filename=r'newplot.png',width=500, height=30)
Out[37]:

INSIGHTS for Readmission :

- 1) For every 15 years up in age, increase 23% odds of readmission
- 2) Discharge type, discharge to home is the most prominent factor in classifying readmission of diabetic patient
- 3) Out of 23 medicine, we found that use of Insulin and Metafromin, increase the chance of readmission
- 4) Race Caucasian are highly-likely to get readmitted in hospital
- 5) Diabetic Male has more odds compared to female for readmission 

</b>

In [38]:
from IPython.display import Image
Image(filename=r'insights.png',width=1000, height=60)
Out[38]:

STEP: 8 - Improvements and Future Work

IMPROVEMENTS in project :

- 1) Using more features such as diag_1, diag_2, diag_3 to build the machine learning model
- 2) Adding more prominent features on product interface as input
- 3) Available dataset was from duration 1999-2008, getting more recent data to build the model
- 4) Since the dataset was highly imbalanced, we restricted balancing of dataset on oversampling technique. We can use undersampling technique to balance dataset and improve the model
- 5) Employing more machine learning algorithm to build model and check betterment, we had used only 3 algorithm Logistic Regression, Decision Tree and Random Forest

</b>

In [39]:
from IPython.display import Image
Image(filename=r'Improvements.png',width=1000, height=60)
Out[39]: